Master’s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs
نویسندگان
چکیده
The Apache Hadoop data processing software is immersed in a complex environment composed of huge machine clusters, large data sets, and several processing jobs. Managing a Hadoop environment is time consuming, toilsome and requires expert users. Thus, lack of knowledge may entail misconfigurations degrading the cluster performance. Indeed, users spend a lot of time tuning the system instead of focusing on data analysis. To address misconfiguration issues we propose a solution implemented on top of Hadoop. The goal is presenting a self-tuning mechanism for Hadoop jobs on Big Data environments. To achieve this, our self-tuning mechanism is inspired by two key ideas: (1) an evolutionary algorithm to generate and test new job configurations, and (2) data sampling to reduce the cost of the self-tuning process. From these ideas we created a framework for testing usual job configurations and get a new configuration suitable to the current state of the environment. Experimental results show gains in job performance against the Hadoop’s default configuration and the rules of thumb. Besides, the experiments prove the accuracy of our solution which is the relation between the cost to obtain a better configuration and the quality of the configuration reached.
منابع مشابه
Proposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms
In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملFraud Detection of Credit Cards Using Neuro-fuzzy Approach Based on TLBO and PSO Algorithms
The aim of this paper is to detect bank credit cards related frauds. The large amount of data and their similarity lead to a time consuming and low accurate separation of healthy and unhealthy samples behavior, by using traditional classifications. Therefore in this study, the Adaptive Neuro-Fuzzy Inference System (ANFIS) is used in order to reach a more efficient and accurate algorithm. By com...
متن کاملGENERALIZED FLEXIBILITY-BASED MODEL UPDATING APPROACH VIA DEMOCRATIC PARTICLE SWARM OPTIMIZATION ALGORITHM FOR STRUCTURAL DAMAGE PROGNOSIS
This paper presents a new model updating approach for structural damage localization and quantification. Based on the Modal Assurance Criterion (MAC), a new damage-sensitive cost function is introduced by employing the main diagonal and anti-diagonal members of the calculated Generalized Flexibility Matrix (GFM) for the monitored structure and its analytical model. Then, ...
متن کاملVerification of an Evolutionary-based Wavelet Neural Network Model for Nonlinear Function Approximation
Nonlinear function approximation is one of the most important tasks in system analysis and identification. Several models have been presented to achieve an accurate approximation on nonlinear mathematics functions. However, the majority of the models are specific to certain problems and systems. In this paper, an evolutionary-based wavelet neural network model is proposed for structure definiti...
متن کامل